Introduction:¶

Welcome to my exciting journey through the world of finance education on Udemy! In this exploration, we'll delve into a rich dataset containing information about various finance courses offered on the popular online learning platform. By leveraging the powerful tools of Python such as pandas, Matplotlib, and Plotly, we aim to uncover valuable insights and trends within this dataset.

Data Exploration:¶

  1. Data Cleaning:

    • Load the Udemy finance course dataset into a pandas DataFrame.
    • Identify and handle any missing values or inconsistencies in the data.
    • Remove outliers to ensure the integrity of our analysis.
  2. Exploratory Data Analysis:

    • Investigate the distribution of course offerings across different categories and topics within finance.
    • Analyze the popularity of courses based on factors such as average ratings, number of reviews, and current pricing.
    • Explore the relationship between course characteristics and subscriber engagement.
In [1]:
!pip install langdetect
!pip install language_data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import langdetect
import langcodes
Requirement already satisfied: langdetect in c:\users\mehul\anaconda3\lib\site-packages (1.0.9)
Requirement already satisfied: six in c:\users\mehul\anaconda3\lib\site-packages (from langdetect) (1.16.0)
Requirement already satisfied: language_data in c:\users\mehul\anaconda3\lib\site-packages (1.2.0)
Requirement already satisfied: marisa-trie>=0.7.7 in c:\users\mehul\anaconda3\lib\site-packages (from language_data) (1.1.0)
Requirement already satisfied: setuptools in c:\users\mehul\anaconda3\lib\site-packages (from marisa-trie>=0.7.7->language_data) (68.2.2)
In [2]:
df_orignal = pd.read_csv("udemy.csv")
df_orignal.sample(5)
Out[2]:
id title url is_paid num_subscribers avg_rating avg_rating_recent rating num_reviews is_wishlisted num_published_lectures num_published_practice_tests created published_time discount_price__amount discount_price__currency discount_price__price_string price_detail__amount price_detail__currency price_detail__price_string
7159 1817268 Start Your Own Business with the 9-5 Dropout A... /course/start-your-own-business-with-the-9-5-d... True 1671 4.45 4.38060 4.38060 11 False 33 0 2018-07-24T13:32:58Z 2018-08-29T16:00:44Z 455.0 INR ₹455 6400.0 INR ₹6,400
3117 1697090 Microsoft Project 2016 by Abu Bakar - 20 PDU's /course/msproject2016/ True 363 4.50 4.48216 4.48216 70 False 25 0 2018-05-15T20:11:39Z 2018-11-16T20:04:15Z NaN NaN NaN 12800.0 INR ₹12,800
12874 3330606 Besserer Umgang mit Geld /course/besserer-umgang-mit-geld/ True 142 4.25 4.30303 4.30303 2 False 24 0 2020-07-14T09:26:20Z 2020-07-15T07:24:43Z 455.0 INR ₹455 1280.0 INR ₹1,280
8988 2382421 How to Massively Increase Course Sales - Chang... /course/online-course-sales-one-thing/ True 3861 4.75 4.69269 4.69269 4 False 14 0 2019-05-24T12:51:53Z 2019-05-31T17:33:33Z 455.0 INR ₹455 8640.0 INR ₹8,640
2638 441848 Business Writing That Gets Results /course/business-writing-that-gets-results/ True 2151 3.75 3.76465 3.76465 91 False 27 0 2015-03-08T10:56:54Z 2015-04-05T16:25:30Z 455.0 INR ₹455 1280.0 INR ₹1,280
In [3]:
df = df_orignal #creating a copy of our dataset
In [4]:
print("Price Detail Currencies:", df['price_detail__currency'].unique())

print("\nValue Counts of Price Detail Currencies:", df['price_detail__currency'].value_counts())

print("\nDiscount Price Currencies:", df['discount_price__currency'].unique())

print("\nValue Counts of Discount Price Currencies:", df['discount_price__currency'].value_counts())

print("\nValue Counts of wishlisted courses: ", df["is_wishlisted"].value_counts())

print("\nValue Counts of practice tests: ", df.num_published_practice_tests.value_counts())
Price Detail Currencies: ['INR' nan]

Value Counts of Price Detail Currencies: price_detail__currency
INR    13111
Name: count, dtype: int64

Discount Price Currencies: ['INR' nan]

Value Counts of Discount Price Currencies: discount_price__currency
INR    12205
Name: count, dtype: int64

Value Counts of wishlisted courses:  is_wishlisted
False    13608
Name: count, dtype: int64

Value Counts of practice tests:  num_published_practice_tests
0    13008
2      214
1      202
6       67
5       43
3       39
4       35
Name: count, dtype: int64
In [5]:
df[["discount_price__amount", "discount_price__price_string", "price_detail__amount", "price_detail__price_string"]].sample(5)
Out[5]:
discount_price__amount discount_price__price_string price_detail__amount price_detail__price_string
12735 NaN NaN 11520.0 ₹11,520
2437 455.0 ₹455 5760.0 ₹5,760
4525 455.0 ₹455 8640.0 ₹8,640
7476 455.0 ₹455 4800.0 ₹4,800
10445 455.0 ₹455 3200.0 ₹3,200

We observed that the majority of the data pertains to either Indian Rupees (INR) or contains no value (NaN). Since these columns offer little insight into our analysis, we will remove them. Similarly, columns like "id" and "url" do not contribute significantly to our understanding of the data and will be dropped as well.

We observe that the string fields serve as symbolic representations of their corresponding discount_price and price_detail amounts, respectively. Furthermore, these string fields exhibit an equal number of NaN values. Hence, it is appropriate to remove them entirely from the dataset.

Since there are no courses that are wishlisted in this dataset then let's just drop this field. Also, courses that are free have some current price which should have been ideally zero.

Most finance courses available on udemy dont have tests for their subscribers to practice.

Moreover, the "created" and "published_time" fields contain datetime format data rather than the object data type. We'll convert these fields to the appropriate datetime format.

In [6]:
df = df.drop(columns = ["id", "url", "price_detail__currency", "discount_price__currency", "discount_price__price_string","price_detail__price_string", "is_wishlisted"])
df.created = pd.to_datetime(df.created)
df.published_time = pd.to_datetime(df.published_time)
df.rename(columns={'discount_price__amount': 'current_price', 'price_detail__amount': "orignal_price", 'num_subscribers':"subscribers"}, inplace=True)
df.loc[df['is_paid'] == False, 'current_price'] = 0
In [7]:
df.dtypes
Out[7]:
title                                        object
is_paid                                        bool
subscribers                                   int64
avg_rating                                  float64
avg_rating_recent                           float64
rating                                      float64
num_reviews                                   int64
num_published_lectures                        int64
num_published_practice_tests                  int64
created                         datetime64[ns, UTC]
published_time                  datetime64[ns, UTC]
current_price                               float64
orignal_price                               float64
dtype: object
In [8]:
df.isnull().sum()
Out[8]:
title                             0
is_paid                           0
subscribers                       0
avg_rating                        0
avg_rating_recent                 0
rating                            0
num_reviews                       0
num_published_lectures            0
num_published_practice_tests      0
created                           0
published_time                    0
current_price                   907
orignal_price                   497
dtype: int64
In [9]:
df = df.interpolate()
df.isna().sum()
C:\Users\Mehul\AppData\Local\Temp\ipykernel_9544\2690831041.py:1: FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead.
  df = df.interpolate()
Out[9]:
title                           0
is_paid                         0
subscribers                     0
avg_rating                      0
avg_rating_recent               0
rating                          0
num_reviews                     0
num_published_lectures          0
num_published_practice_tests    0
created                         0
published_time                  0
current_price                   0
orignal_price                   0
dtype: int64
In [10]:
df.describe()
Out[10]:
subscribers avg_rating avg_rating_recent rating num_reviews num_published_lectures num_published_practice_tests current_price orignal_price
count 13608.000000 13608.000000 13608.000000 13608.000000 13608.000000 13608.000000 13608.000000 13608.000000 13608.000000
mean 2847.010435 3.923293 3.912242 3.912242 243.169827 32.224794 0.110523 476.277741 4719.647266
std 9437.865634 1.031304 1.039237 1.039237 1580.965895 42.766911 0.623501 276.566197 3206.448904
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1280.000000
25% 62.000000 3.800000 3.787315 3.787315 7.000000 12.000000 0.000000 455.000000 1600.000000
50% 533.000000 4.194440 4.181735 4.181735 24.000000 21.000000 0.000000 455.000000 3200.000000
75% 2279.500000 4.450000 4.452105 4.452105 87.000000 37.000000 0.000000 455.000000 8640.000000
max 374836.000000 5.000000 5.000000 5.000000 78006.000000 699.000000 6.000000 3200.000000 12800.000000

Content Depth Analysis: Exploring the relationship between the number of published lectures and course ratings. Determining whether courses with a higher number of lectures tend to have better ratings and higher subscriber engagement.

In [11]:
df[["avg_rating", "avg_rating_recent", "subscribers", "num_reviews", "num_published_lectures"]].corr()
Out[11]:
avg_rating avg_rating_recent subscribers num_reviews num_published_lectures
avg_rating 1.000000 0.989222 0.082100 0.068631 0.116836
avg_rating_recent 0.989222 1.000000 0.084198 0.069347 0.113875
subscribers 0.082100 0.084198 1.000000 0.784190 0.211672
num_reviews 0.068631 0.069347 0.784190 1.000000 0.179063
num_published_lectures 0.116836 0.113875 0.211672 0.179063 1.000000

There is a very high correlation between different rating metrics but it seems that they do not relate to number of reviews that much. Similarly, number of subscribers is highly correlated with number of reviews.

Now let's create some new fields.

In [12]:
# Defining the bins and labels for the categories
bins = [0, 2.5, 3.9, 4.3, 4.8, 5]
labels = ['Poor', 'Below Average', 'Above Average', 'Good', 'Excellent']
# Creating a new column to categorize courses based on avg-rating
df['rating_category'] = pd.cut(df['avg_rating'], bins=bins, labels=labels, right=False)
df[["avg_rating", "rating_category"]].sample(10)
Out[12]:
avg_rating rating_category
9013 5.00000 NaN
6195 4.30000 Good
2043 3.93182 Above Average
10180 3.75000 Below Average
11008 5.00000 NaN
13340 0.00000 Poor
4984 4.30000 Good
12133 3.80000 Below Average
11061 3.00000 Below Average
2106 4.34328 Good
In [13]:
df['total_sales'] = df['subscribers'] * df['current_price']
paid_courses = df[df['is_paid'] == True]
free_courses = df[df["is_paid"] == False]

We can see there are many similar courses so let's fitler them out and keep our courses distinct.

In [14]:
duplicates = df.duplicated(subset=['title'])
print("Number of Duplicate titles:", df[duplicates]['title'].shape)
Number of Duplicate titles: (45,)
In [15]:
df = df[~duplicates]
#checking again to see if it worked
duplicates = df.duplicated(subset=['title'])
print("Number of Duplicate titles:", df[duplicates]['title'].shape)
Number of Duplicate titles: (0,)

Now our data is consistent. Next, let's visualise how many finance courses are paid.

Visualization:¶

  1. Data Visualization:

    • Utilize Matplotlib and Plotly to create engaging visualizations that highlight key insights from our data exploration.
    • Generate bar charts, scatter plots, and box plots to illustrate trends and patterns in course popularity, ratings, and pricing.
    • Ensure that the visualizations are clear, informative, and visually appealing.
  2. Interactive Visualization:

    • Harness the interactive capabilities of Plotly to create dynamic dashboards and plots.
    • Allow users to interactively explore the dataset, filter courses based on their preferences, and gain deeper insights into finance education trends on Udemy.
In [16]:
fig = go.Figure(data=go.Violin(y=paid_courses['current_price'],x = paid_courses["rating_category"],
                                box_visible=True,
                                line_color='blue',
                                meanline_visible=True,
                                fillcolor='lightblue',
                                opacity=0.6,
                                hoverinfo='y+name'))
fig.update_layout(
    title='Distribution of Current Prices Across Rating Categories',
    xaxis_title='Rating Category',
    yaxis_title='Current Price',
    yaxis=dict(range=[400, 700]),
    height=400,
    width=1000,
)

fig.show()
In [17]:
plt.figure(figsize = (4,3))
# Countplot for paid vs. free courses
sns.countplot(data=df, x='is_paid', width=0.5)
plt.title('Paid vs Free Courses')
plt.xlabel('Course Type')
plt.xticks([0, 1], labels = ['Free', 'Paid'])
plt.show()
No description has been provided for this image
In [18]:
# Course Creation or Publication Over Time
# Aggregate the data by year and count the number of courses created or published each year
courses_over_time = df.groupby(df['created'].dt.year)['title'].count()

# Create a line plot to visualize the trend of course creation or publication over time
fig_time_series = go.Figure(data=go.Scatter(x=courses_over_time.index, y=courses_over_time.values,
                                            mode='lines+markers', marker=dict(color='blue'),
                                            name='Courses Created or Published'))

fig_time_series.update_layout(title='Finance Courses Created or Published Over Time',
                              xaxis_title='Year', yaxis_title='Number of Courses',
                              height=500, width=900)
fig_time_series.show()
In [19]:
fig = go.Figure(go.Violin(
    y=df['num_published_lectures'],
    box_visible=True,
    line_color='blue',
    meanline_visible=True,
    fillcolor='lightblue',
    opacity=0.6
))

# Update layout
fig.update_layout(
    title='Distribution of Number of Published Lectures',
    yaxis_title='Number of Published Lectures',
    yaxis=dict(range=[0, 100]),
    height=500,
    width=800
)

# Show plot
fig.show()
In [20]:
# Distribution of Time to Publication
# Calculate the time taken from course creation to publication (in days)
df['time_to_publication'] = (df['published_time'] - df['created']).dt.days

# Create a violin plot to visualize the distribution of time to publication
fig_time_to_publication_violin = go.Figure(data=go.Violin(y=df['time_to_publication'],
                                                          box_visible=True,
                                                          line_color='blue',
                                                          meanline_visible=True,
                                                          fillcolor='lightblue',
                                                          opacity=0.6,
                                                          hoverinfo='y+name'))


fig_time_to_publication_violin.update_layout(title='Distribution of Time to Publication for Finance Courses',
                                              yaxis_title='Time to Publication (Days)',
                                              yaxis=dict(range=[-1, 200]),
                                              height=500, width=900)

fig_time_to_publication_violin.show()
In [21]:
top_paid_courses = paid_courses.nlargest(5, 'subscribers')
top_free_courses = free_courses.nlargest(5, 'subscribers')

fig_paid = go.Figure(go.Bar(
    x=top_paid_courses['subscribers'],
    y=top_paid_courses['title'],
    orientation='h',
    marker=dict(color='crimson')
))
fig_paid.update_layout(
    title='Top 5 Most Subscribed Paid Courses',
    xaxis_title='Subscribers',
    yaxis_title=None,
    height=400,
    width=1100
)
fig_free = go.Figure(go.Bar(
    x=top_free_courses['subscribers'],
    y=top_free_courses['title'],
    orientation='h',
    marker=dict(color='olive')
))
fig_free.update_layout(
    title='Top 5 Most Subscribed Free Courses',
    xaxis_title='Subscribers',
    yaxis_title=None,
    height=400,
    width=1100
)

# Show plots
fig_paid.show()
fig_free.show()
In [22]:
# Sort the DataFrame based on total sales in descending order and select top 7 courses
top_7_courses = paid_courses.nlargest(7, 'total_sales')

# Create a horizontal bar chart to visualize the top 7 courses with the highest sales
fig = go.Figure(go.Bar(
    x=top_7_courses['total_sales'],
    y=top_7_courses['title'],
    orientation='h',
    marker=dict(color='crimson')
))

# Update layout
fig.update_layout(
    title='Top 7 Courses with the Highest Sales on Udemy',
    xaxis_title='Total Sales (INR)',
    yaxis_title=None,
    height=400,
    width=1200,
    bargap=0.2
)

# Show plot
fig.show()
In [23]:
fig = px.scatter(df, x='avg_rating', y='num_reviews', color='rating_category', opacity=0.7,
                 title='Average Rating vs. Number of Reviews',
                 labels={'avg_rating': 'Average Rating', 'num_reviews': 'Number of Reviews'},
                 hover_name='title', hover_data={'avg_rating': True, 'num_reviews': True})

# Customize the layout
fig.update_layout(title_x=0.5, height=400, width=1000,
                  xaxis=dict(showgrid=True),
                  yaxis=dict(showgrid=True))
fig.show()
C:\Users\Mehul\anaconda3\Lib\site-packages\plotly\express\_core.py:1958: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

Let's visualise courses based on this new field "rating_category"

In [24]:
# Count the number of courses in each rating category
rating_counts = df['rating_category'].value_counts()

# Create a custom color scale
colors = px.colors.qualitative.Set1

# Create a horizontal bar chart
fig = go.Figure(go.Bar(
    x=rating_counts.values,
    y=rating_counts.index,
    orientation='h',
    marker=dict(color=colors[:len(rating_counts)]),
))

# Update layout
fig.update_layout(
    title='Distribution of Courses by Rating Category',
    xaxis_title='Count',
    yaxis_title='Course rating category',
    height=500,
    width=800
)

# Show plot
fig.show()
In [25]:
def create_scatter_plot(df, rating_categories, title):
    fig = go.Figure()
    for category in rating_categories:
        df_category = df[df['rating_category'] == category]
        correlation = df_category['num_reviews'].corr(df_category['avg_rating'])
        fig.add_trace(go.Scatter(x=df_category['num_reviews'], y=df_category['avg_rating'],
                                 mode='markers', marker=dict(size=8, opacity=0.7),
                                 name=f'{category} courses (Corr: {correlation:.2f})',
                                 text=[f'Corr: {correlation:.2f}' for _ in range(len(df_category))]))
    fig.update_layout(title=title, xaxis_title='Number of Reviews', yaxis_title='Average Rating',
                      height=400, width=900)
    fig.show()

create_scatter_plot(free_courses, free_courses['rating_category'].unique(),
                    'Impact of Reviews on Average Rating for Free courses')
create_scatter_plot(paid_courses, paid_courses['rating_category'].unique(),
                    'Impact of Reviews on Average Rating for Paid courses')

We can confidently say that number of reviews do not impact average ratings in any way across diferent categories.

In [26]:
# Function to detect language of a title
def detect_language(title):
    try:
        return langdetect.detect(title)
    except:
        return 'Unknown'

# Function to map short forms to full names
def map_language(short_form):
    try:
        return langcodes.Language.make(short_form).language_name().title()
    except langcodes.LanguageValueError:
        return 'Unknown'

# Apply language detection to title and mapping to language field
df['language'] = df['title'].apply(detect_language).apply(map_language)
In [27]:
language_counts = df['language'].value_counts().reset_index()
language_counts.columns = ['Language', 'Count']

language_counts = language_counts.sort_values(by='Count', ascending=False)
top_languages = language_counts.head(3)
other_languages_count = language_counts['Count'].sum() - top_languages['Count'].sum()

labels_pie = top_languages['Language'].tolist() + ['Others']
values_pie = top_languages['Count'].tolist() + [other_languages_count]

fig_pie = go.Figure(data=[go.Pie(labels=labels_pie, values=values_pie, hole=0.5)])

fig_pie.update_layout(
    title='Top 3 Languages in Finance Courses',
    width=600,
    height=400,
    showlegend=True
)

fig_pie.show()

language_counts_treemap = language_counts.iloc[3:]

fig_treemap = px.treemap(language_counts_treemap, path=['Language'], values='Count', title="Let's zoom on 'other'")

# Show treemap
fig_treemap.show()
In [ ]: